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ParFORM: recent development * 

M. Tentyukov a ^, J. A.M. Vermaseren and H.M. Staudenmaier ac , 

a Institut fur Theoretische Teilchenphysik, Universitat Karlsruhe, Germany 

b NIKHEF, Amsterdam 

c Interfak. Institut fiir Anwendungen der Informatik, Universitat Karlsruhe, Germany 

We report on the status of our project of parallelization of the symbolic manipulation program FORM. We 
have now parallel versions of FORM running on Cluster- or SMP-architectures. These versions can be used to 
run arbitrary FORM programs in parallel. 



1. General conceptions of current version 

FORM y is a program for symbolic manipula- 
tion of algebraic expressions specialized to handle 
very large expressions of millions of terms in an 
efficient and reliable way. That is why it is widely 
used in Quantum Field Theory, where the calcu- 
lation of the order of several hundred (sometimes 
thousands) of Feynman diagrams is required. 

In context with this goal an improvement in 
computing efficiency is very important. Paral- 
lelization is one of the most efficient way to in- 
crease performance. So the idea to parallelize 
FORM is quite natural. 

ParFORM is the parallel version of FORM de- 
veloped in Karlsruhe since 1998 0. At present, a 
number of real physical applications exist which 
were performed with the help of ParFORM [3]. 

There are some internal mechanisms of FORM 
that makes FORM very well suited for paralleliza- 
tion |2I4| . The concept of parallelization is in- 
dicated in Fig. ^ Distribute the input terms 
among available processors, let each of them per- 
form local operations on its input terms, generate 
and sort the arising output terms. At the end of 
a module the sorted streams of terms from all 
processors have to be merged to one final output 
stream again. 

A master process initializes the distribution of 
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Figure 1. General conception of ParFORM: a 
Master-Slave structure for parallelization. 



terms and finally collects the results The real 
and time consuming calculations however are per- 
formed by slaves. Each process is an independent 
stream of commands operating on independent 
data. The master communicates with slaves by 
means of different message passing libraries and 
we use MPI 1 . 

The master simply distributes and collects 



1 see |http: / /www-unix.mcs. a nl.gov/mpi / standard.html 



1 



2 



M. Tentyukov, J. A.M. Vermaseren and H.M. Staudenmaier, 



data, i.e with a lower number of processors, the 
master becomes almost idle. For that case one 
can try to force the master to participate in real 
calculations, too. On the other hand, with an 
increasing number of slaves, the master spends 
more and more time to control slaves, which may 
lead to early speedup saturation. Our estima- 
tions show that for more than four processors our 
Master-Slave model is adequate. Since almost all 
real calculations are performed by slaves we cal- 
culate speedups normalized to the time spent by 
program running on two processors, one master 
and one slave. 

Using the message passing library permits to 
parallelize FORM on computer architectures, i.e. 
with shared (SMP 2 ) and distributed (clusters) 
memories. 

The results for the program BAICER 3 run- 
ning on the SMP SGI Altix 3700 Server 32x 1.3 
GHz/3MB-SC Itanium2 CPUs are shown in Fig. 
The speedup is almost linear up to 12 proces- 
sors. Afterwards the speedup becomes nonlinear 
but is still considerable. 

The second architecture is a cluster. The re- 
sults of running BAICER on an IWR 4 Xeon clus- 
ter [S] are shown in Fig. |3 

The speedup curve has a "positive" slope even 
for more than 8 processors, but the absolute value 
of this slope is rather small, and so the speedup is 
reasonable only for a few nodes. In case of cluster 
computers as jS] it could be better to involve the 
master processor in real calculations too, but this 
should be studied in detail. 

2. ParFORM on SMP 

The main disadvantage of the message pass- 
ing approach is a considerable overhead due to 
huge data transfers. On SMP computers one 
can attempt to get rid of this overhead using e.g. 
threads [7]. But in this concept there are several 
points which have to be taken into account: 

2 Symmetric Multiprocessor 

3 All benchmarks mentioned in the paper were made by 
running the same "standard" test example |1] obtained 
from a package BAICER developed by P. Baikov following 
methods described in 

4 The Institute for Scientific Computing of the 
Forschungszentrum Karlsruhe. 
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Figure 2. Computing time and speedup for the 
test program BAICER on the SGI Altix 3700 
server with 32x Itanium2 processors (1.3 GHz), 
a SMP-type machine. 

Within a typical SMP machine, all the memory 
is uniformly available to each processor, so-called 
"Uniform Memory Access" . All memory accesses 
are made by the same shared memory bus. This 
works quite well for a relatively small number of 
CPUs. Increasing the number of CPUs, a prob- 
lem with the shared bus appears due to the colli- 
sion rate between multiple CPU requests on the 
single memory bus. 

In order to avoid these scalability limits of 
SMP architectures, the "Non-Uniform Mem- 
ory Access" (NUMA) architecture was designed. 
NUMA assumes that each processor has its own 
local memory but it can also access memory 
owned by other processors. 

As a result the concept of Memory Affinity has 
to be introduced: memory may be situated at dif- 
ferent "distance" from the processor. On a SGI 
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Figure 3. Computing time and speedup for the 
test program BAICER on the cluster of dual Intel 
Xeon, 2.4 GHz, 4x-InfiniBand. 



in their cache. Normally the cache data have to 
be invalidated. Usually, NUMA computers use 
special-purpose hardware to maintain cache co- 
herence. Such systems are called "cache-coherent 
NUMA" , or ccNUMA g). The worst case for such 
an approach is mutual cache invalidation, when 
two (ore more) processors are writing to the same 
memory region. 

Let us consider now how ccNUMA could be 
used for a good parallelizable problem as the mul- 
tiplication of two matrices. 

Let us take the simplest algorithm is well suited 
for multithreaded process: each thread reads one 
row of the first matrix, one column of the sec- 
ond matrix and sums up the result of the multi- 
plication in a local- or even register- variable. In 
this case the only instruction is "read from mem- 
ory". Practically all arithmetic operations are 
performed on the local memory, and only at the 
end the result is written into the global (shared) 
memory. 



Thread 2 
+X A 2 




Sor/ti} 
+2*a A 2 +3*a*b +fa A 2 



Altix, the ratio of remote to local memory access 
times varies from 1.9 to 3.5, depending on the rel- 
ative locations of the processor and the memory. 

Usually this is not a problem since nearly all 
CPU architectures use a cache to exploit locality 
of reference in memory accesses. Because nearly 
everything is in a cache, often one may safely 
ignore problems resulting from the difference in 
memory affinity. But not in the case of FORM as 
discussed below. 

As consequence of the frequent "cache-use" - 
as just described - the new problem of cache co- 
herency arises: if one of the processors modifies 
some piece of data (i.e., performs a "write" op- 
eration), then the other processors have access 
only to an out-of-date copy of these data stored 



Figure 4. Possible multithreaded approach of 
FORM parallelization. 

Unfortunately, the structure of FORM is quite 
different. Indeed, let us suppose that each thread 
treats one term, Fig. rj] Then the thread pro- 
duces a lot of new terms which should be stored 
in the shared memory. This would lead to per- 
manent cache invalidation. It indicates that the 
internal FORM structure is not well suited for 
multi-threaded parallelization on ccNUMA archi- 
tecture. 

Alternatively, one can exploit the multi-process 
Master-Slave structure, 
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Master 
PROCESSO 



Input for: 
Slave 1 Slave 2 
Slave! 
PROCF" 




RESULT 



Figure 5. Master-Slave approach on SMP with- 
out MPI. 



Instead we could try to use the Master-Slave 
model discussed above with multiple processes, 
but now of course without MPI. For communi- 
cation between the master and slaves we could 
use shared memory, allocating the shared mem- 
ory buffers "close" to each slave, Fig. El We refer 
to this model as "Shared Memory" (SM). As be- 
fore, the master splits data into chunks and dis- 
tributes them among slaves placing data to shared 
memory buffers. Slaves manipulate these data in 
their local memory, and the (pre-sorted) results 
are collected by the master. Here we have ex- 
plicit control on memory affinity, and no message 
passing bottleneck anymore. 

The Master-Slave model permits to optimize 
the communication between slaves and the mas- 
ter. For example, no direct communication be- 
tween slaves is allowed 5 , so a lot of optimization 
available at low level provided the structure is re- 
stricted by this communication topology. 

3. First results 

We implemented the ideas described in the pre- 
vious section in ParFORM (we call it ParFORM- 
SM) and want to present first results in the fol- 



lowing. 

In Fig. one can see the comparison with the 
results from Fig. [3 We can immediately see 
about 20% performance improvement compared 
with the previous MPI version. But the most im- 
portant observation is that now the communica- 
tion overhead is almost negligible as can be seen 
comparing the first two data points in Fig. a). 
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5 MPI has "peer-to-peer" 
complicated. 



structure, which is much more 



Figure 6. Results of running the test program on 
SGI Altix 3700 with MPI-based communications 
(MPI) and with Shared Memory segments (SM) 
normalized to the sequential version. 



Here we normalize the results not to the two- 
processor time, but to the time spent by the 
corresponding sequential version of the program. 
Looking at the difference in times between one 
processor (sequential program) and two proces- 
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sors for MPI variant (solid line) , we may see about 
20% of performance reduction. The reason is due 
to the communication overhead. Indeed, in a two- 
processor mode the single slave is doing almost 
all the job (except the final sorting) and the pro- 
gram spends some extra time for communication 
between the master and the slave. 

For the shared memory based program, the dif- 
ference in one- and two-processor regimes is un- 
observable. This indicates that the communica- 
tion overhead has no real significance in this SM 
model. 

Increasing the number of processors, an other 
bottleneck arises: the time for final sorting be- 
comes more and more essential. Since this sorting 
is performed only by the master, all the slaves are 
idle during this stage. This explains the speedup 
saturation around 30 nodes both for MPI and SM 
approaches. 



4. Outlook 

We shortly want to discuss the various aspects 
of the models and architectures described before. 

On ccNUMA computers, instead of MPI, we 
should use the multiprocessed model (see Sect.EJ 
with multiple shared memory segments. The 
corresponding shared memory approach was de- 
veloped, tested and demonstrates stable perfor- 
mance improvement around 20%. The communi- 
cation overhead is negligible and the main bottle- 
neck is the final sorting stage, so it seems to be 
reasonable to parallelize in future the final sorting 
process first. 

On clusters, there are no alternatives to MPI 
at the moment. 

In the present cluster version of ParFORM the 
communication overhead is quite big and thus it 
can only be used for a relatively small number 
of nodes. In particular, it seems that it would 
be advantageous if also the master participates 
in real calculations. 

Colleagues who are interested to use ParFORM 
software should contact M.Tentyukov 
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